# Make sure, that StringsAsFactors = FALSE
options(stringsAsFactors = F)
if(!require(install.load)){
install.packages("install.load")
library(install.load)
}
install_load("tidyverse", "moments", "plotly", "data.table", "fitdistrplus")
library(tidyverse)
library(moments)
library(plotly)
library(data.table)
library(fitdistrplus)
In this task, we have to create a distribution of logistic delay of
component K7. Two data sets were provided, which are
Komponente_K7.csv, which contains production date, and
Logistikverzug_K7.csv, which contains incoming date of the
product. To start analyzing the data, we must import the data sets and
set their column types accordingly.
# For semicolon delimited
K7 <- read_delim("Data/Logistikverzug/Komponente_K7.csv",delim = ";")[2:6]
K7$Fehlerhaft <- as.logical(K7$Fehlerhaft)
K7$Produktionsdatum <- as.Date(K7$Produktionsdatum,
format = "%Y-%m-%d")
# For comma delimited
LK7 <- read_delim("Data/Logistikverzug/Logistikverzug_K7.csv",delim = "," )[2:6]
LK7$Fehlerhaft <- as.logical(LK7$Fehlerhaft)
LK7$Wareneingang <- as.Date(LK7$Wareneingang,
format="%Y-%m-%d")
After importing the data, we check if the column Fehlerhaft in both data sets are equal.
Result<- K7$Fehlerhaft == LK7$Fehlerhaft
Result[FALSE]
## logical(0)
Based on this result, the column Fehlerhaft are in both data sets
equal. Now, both data sets K7 and LK7 can be combined by using
full_join(). A column with the name Wochentag is created
for the day of production date. Afterwards, all NA values in the column
Produktionsdatum and Wareneingang are filtered. A column Warenausgang is
created, if the day of production date is Friday and Saturday. If the
production day is Friday and Saturday, the product can be sent only on
Monday, hence the production date is added by 3 or 2 respectively. For
all other days, this date is added by 1.
logistics_delay <- K7 %>%
full_join(LK7, by = c("IDNummer","Herstellernummer", "Werksnummer","Fehlerhaft")) %>%
mutate(Wochentag = weekdays(Produktionsdatum)) %>% # Creates a new column Wochentag
filter(!is.na(Produktionsdatum)) %>% # Filter NA values
filter(!is.na(Wareneingang)) %>% # Filter NA values
mutate(Warenausgang = ifelse(Wochentag == "Freitag",
Produktionsdatum +3,
ifelse(Wochentag == "Samstag",
Produktionsdatum +2,
Produktionsdatum +1)))
The column Warenausgang can now be formatted in date-format. Logistic delay can then be calculated by substracting the incoming date (Wareneingang) with the outgoing date (Warenausgang).
logistics_delay <- logistics_delay %>%
mutate(Warenausgang = as.Date(Warenausgang,
origin = "1970-01-01",
format="%Y-%m-%d")) %>%
mutate(Verzug = Wareneingang - Warenausgang)
To determine the distribution of the logistic delay, we use the
function descdist from the package
fitdistrplus. To use this, the column Verzug needs to be
converted in numeric. Afterwards, the column is analyzed for discrete
distribution, since the data contains integer values.
# Convert Verzug column in numeric
logistics_delay$Verzug <- as.numeric(logistics_delay$Verzug)
# Determine the logistics distribution
descdist(logistics_delay$Verzug, discrete = TRUE)
## summary statistics
## ------
## min: 1 max: 14
## median: 6
## mean: 5.652041
## estimated sd: 1.246412
## estimated skewness: 0.03616141
## estimated kurtosis: 3.305384
According to the Cullen and Frey graph, the logistics data follows negative binomial distribution. Based on the fact that the skewness value is positive and the mean value is slightly larger than median, the distribution is positively skewed (Source: https://www.vrcbuzz.com/karl-pearsons-coefficient-of-skewness-using-r-with-examples/).
The skewness value is to analyze the symmetrie or the lack of symmetrie of the distribution. In this case it is possible to observe that this is negative. The value is -0,84. This means that the tail is on the left side of the distribution. The kurtosis value is greather than 3, it is said to be leptokurtic.This means that there are more chances to be outliers and that the distribution is peakead and haslong tails (thick)
summary(logistics_delay$Verzug)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.000 6.000 5.652 6.000 14.000
#skew(logistics_delay$Verzug)
mean <- mean(logistics_delay$Verzug)
median <- median(logistics_delay$Verzug)
stdev <- sd(logistics_delay$Verzug)
skewness <-3*(mean-median)/stdev #Pearson’s Coefficient of Skewness using the median
kurtosis<-kurtosis(logistics_delay$Verzug)
cat("skewness: ",skewness)
## skewness: -0.8375069
cat("kurtosis: ",kurtosis)
## kurtosis: 3.30536
#https://www.vrcbuzz.com/karl-pearsons-coefficient-of-skewness-using-r-with-examples/
To calculate the mean, following code is implemented.
mean <- as.numeric(mean(logistics_delay$Verzug))
# Ausgeben der Ergebnisse
cat("Mean = ", mean)
## Mean = 5.652041
binwith <-(14-1)/14
logistics_delay <-ggplot(logistics_delay, aes(x = Verzug ))+
geom_histogram(aes(y = stat(density)),colour="black", fill="white", binwidth=1)+
scale_x_continuous(breaks = c(0:14))+
geom_density( fill="#FF6666",adjust = 10,alpha = 0.5)
ggplotly(logistics_delay)
To create a decision tree, its important to know what and how many attributes there are, that influence the defect of component K7. Based on the data set K7, there are four columns that could be relevant. These are IDNummer, Produktionsdatum, Herstellernummer and Werksnummer. It could be noticed, that the IDNummer, Hersteller, and Werksnummer contain similar information. Herstellernummer and Werksnummer are directly connected, hence both should not be considered as different attribute. Inside the IDNummer, Herstellernummer and Werksnummer are also included and followed by the serie number of production. Hence, the IDNummer should also not be included as a different attribute. Based on this, the defect of component K7 is attributed solely from Produktionsdatum and Herstellernummer. The data set could be summarized as follow:
K7_decision_tree <- K7 %>%
dplyr::select(Produktionsdatum, Herstellernummer, Fehlerhaft) %>%
arrange(desc(Fehlerhaft))
head(K7_decision_tree)
## # A tibble: 6 × 3
## Produktionsdatum Herstellernummer Fehlerhaft
## <date> <dbl> <lgl>
## 1 2011-02-02 114 TRUE
## 2 2011-06-02 114 TRUE
## 3 2013-10-29 112 TRUE
## 4 2014-01-14 112 TRUE
## 5 2014-06-02 112 TRUE
## 6 2014-11-20 112 TRUE
tail(K7_decision_tree)
## # A tibble: 6 × 3
## Produktionsdatum Herstellernummer Fehlerhaft
## <date> <dbl> <lgl>
## 1 2016-11-12 112 FALSE
## 2 2016-11-12 112 FALSE
## 3 2016-11-12 112 FALSE
## 4 2016-11-12 112 FALSE
## 5 2016-11-12 112 FALSE
## 6 2016-11-13 112 FALSE
The next step would be to determine, which of the attributes can be considered as root characteristics. To do this, we have to measure Information Gain value \(IG(D,A)\) by applying each attribute A of both Produktionsdatum and Herstellernummer to the destination attribute D, which is in this case Fehlerhaft . This can be calculated by substracting \(H(D)\), which is the entropy of the datset of destination attribute D, with \(Rest(A)\), which is the remaining entropy that is still there after testing attribute A.
To calculate entropy of the dataset D, following equation is used,
whereby the variable \(q\) describes
the proportion of defective components to the total number of
components:
Afterwards, we can calculate the remaining entropy of each attributes (Produktionsdatum and Herstellernummer) by using following equation:
For this, \(p_{k}\) and \(n_{k}\) corresponds with the number of
defective and non-defective components in correlation with each
component of corresponding attributes. \(q_{k}\) can be calculated by dividing \(p_{k}\) with the sum of \(p_{k}\) and \(n_{k}\). By using this equation, we can
calculate the remaining entropy for both Produktionsdatum - \(Rest(Produktionsdatum)\) - and
Herstellersnummer - \(Rest(Herstellersnummer)\) and hence
determine the Information Gain for both \(IG(D, Produktionsdatum\) and \(IG(D, Produktionsdatum)\). The attribute
with the largest information gain would then be selected as the root
attribute and the other as the sub-attribute. The decision tree may look
as follow.
It is reasonable to save data in separate files instead of one huge table, because:
#reading zulassung file
zulassung <- read_delim("Data/Zulassungen/Zulassungen_alle_Fahrzeuge.csv",delim = ";")
## New names:
## Rows: 3204104 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: ";" chr
## (2): IDNummer, Gemeinden dbl (1): ...1 date (1): Zulassung
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
#filtering Zulassung
zulassung_adel=zulassung%>%filter(Gemeinden =="ADELSHOFEN")
#reading T16 Teil Komponente
#for semicolon delimited
T16_Kom_1 <- read_delim("Data/Komponente/Bestandteile_Komponente_K2LE2.csv",delim =";")%>%dplyr::select("ID_T16","ID_K2LE2")
## New names:
## Rows: 163769 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: ";" chr
## (4): ID_T16, ID_T19, ID_T20, ID_K2LE2 dbl (1): ...1
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
names(T16_Kom_1)[2]<-"ID_Sitze"
T16_Kom_2 <- read_delim("Data/Komponente/Bestandteile_Komponente_K2ST2.csv",delim =";")%>%dplyr::select("ID_T16","ID_K2ST2")
## New names:
## Rows: 655075 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: ";" chr
## (4): ID_T16, ID_T17, ID_T18, ID_K2ST2 dbl (1): ...1
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
names(T16_Kom_2)[2]<-"ID_Sitze"
Kom_K2LE2<- read.delim("Data/Komponente/Komponente_K2LE2.txt",sep = "\\")%>%dplyr::select(c("ID_Sitze","Werksnummer")) # not need to read it doesnt give any extra info
Kom_K2ST2<-read_delim("Data/Komponente/Komponente_K2ST2.csv",delim = ";")%>%dplyr::select(c("ID_Sitze","Werksnummer"))
## New names:
## Rows: 655075 Columns: 10
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: ";" chr
## (2): ID_Sitze, origin dbl (7): ...1, X1, Herstellernummer, Werksnummer,
## Fehlerhaft, Fehlerhaft_Fa... date (1): Fehlerhaft_Datum
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
T16_K2LE2<-T16_Kom_1%>%left_join(Kom_K2LE2,by="ID_Sitze")
T16_K2ST2<-T16_Kom_2%>%left_join(Kom_K2ST2,by="ID_Sitze")
T16_Kom_bind<-rbind(T16_K2LE2,T16_K2ST2)
#read Bestandteil Fahrzeuge and combine them
Kom_Fahr_1<-read_delim("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM2_Typ21.csv",delim = ";")%>%dplyr::select(c("ID_Sitze","ID_Fahrzeug"))
## New names:
## Rows: 512354 Columns: 6
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: ";" chr
## (5): ID_Karosserie, ID_Schaltung, ID_Sitze, ID_Motor, ID_Fahrzeug dbl (1): ...1
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
Kom_Fahr_2<-read_delim("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM2_Typ22.csv",delim =";")%>%dplyr::select(c("ID_Sitze","ID_Fahrzeug"))
## Rows: 306490 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (5): ID_Karosserie, ID_Schaltung, ID_Sitze, ID_Motor, ID_Fahrzeug
## dbl (1): X1
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Kom_Fahr_bind<-rbind(Kom_Fahr_1,Kom_Fahr_2)
#join the different datasets to create a dataset that only contains the IDs of the components that where produced in ADELSHOFEN
Kom_Fahr <- T16_Kom_bind %>%left_join(Kom_Fahr_bind, by ="ID_Sitze")
Kom_Fahr_Adel <-Kom_Fahr%>%inner_join(zulassung_adel,by =c("ID_Fahrzeug"="IDNummer"))
Number of Cars register in ADELSHOFEN:
cat("Cars registered at Adelshofen",nrow(Kom_Fahr_Adel))
## Cars registered at Adelshofen 48
The Data type of the table “Zulassung” is shown with the str() function
str(zulassung)
## spec_tbl_df [3,204,104 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ...1 : num [1:3204104] 408097 408098 1 2 3 ...
## $ IDNummer : chr [1:3204104] "11-1-11-1" "11-1-11-2" "12-1-12-1" "12-1-12-2" ...
## $ Gemeinden: chr [1:3204104] "DRESDEN" "DRESDEN" "LEIPZIG" "LEIPZIG" ...
## $ Zulassung: Date[1:3204104], format: "2009-01-01" "2009-01-01" ...
## - attr(*, "spec")=
## .. cols(
## .. ...1 = col_double(),
## .. IDNummer = col_character(),
## .. Gemeinden = col_character(),
## .. Zulassung = col_date(format = "")
## .. )
## - attr(*, "problems")=<externalptr>
To Save the records on the database of a server has the advantage that no local memory is occupied but more important, a permanent access to the most recently data records can be in most cases guaranteed, with a few exception as for example when maintenance takes place.
Since databases tend to store a large amount of data, this could cause the computer to slow down. Also the quality of the data deteriorate very fast, since the data doesn’t get updated.
“RStudio Shiny Server Open Source” allows customers to get access to the app from anywhere with any web browser
In order to find out where car with the carrosserie “K5-112-1122-79” got registered, it is important to know the car ID. This is possible by reading the file ” Bestandteile Fahrzeuge”
car<-read_delim("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM1_Typ12.csv",delim =";")%>%dplyr::select(c("ID_Karosserie","ID_Fahrzeug"))%>%filter(ID_Karosserie == "K5-112-1122-79")
## New names:
## Rows: 408096 Columns: 6
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: ";" chr
## (5): ID_Karosserie, ID_Schaltung, ID_Sitze, ID_Motor, ID_Fahrzeug dbl (1): ...1
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
car_registerd<-filter(zulassung, IDNummer == car$ID_Fahrzeug)
gemeinden <- as.character(car_registerd$Gemeinden)
cat("Car with ID Karosserie'K5-112-1122-79' got registered in ",
gemeinden)
## Car with ID Karosserie'K5-112-1122-79' got registered in ASCHERSLEBEN